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Abstract 

Nonparametric  density  estimation  requires  the  specification  of  smoothing  parameters.  The 
demands  of  statistical  objectivity  make  it  highly  desirable  to  base  the  choice  on  properties  of  the 
data  set.  In  this  paper  we  introduce  some  biased  cross-validation  criteria  for  selection  of  smooth¬ 
ing  parameters  for  kernel  and  histogram  density  estimators.  These  criteria  are  obtained  by 
estimating  L  2-norms  of  derivatives  of  the  unknown  density  and  provide  slightly  biased  estimates 
of  the  average  squared-L2  error  or  mean  integrated  squared  error.  These  criteria  are  roughly  the 
analog  of  the  generalized  cross-validation  procedure  for  orthogonal  series  density  estimators.  We 
present  the  relationship  of  the  biased  cross-validation  procedure  to  the  least  squares  cross- 
validation  procedure,  which  provides  unbiased  estimates  of  the  mean  integrated  squared  error. 
Both  methods  are  shown  to  be  based  on  U  -statistics.  We  compare  the  two  methods  by  theoreti¬ 
cal  calculation  of  the  noise  in  the  cross-validation  fimctions  and  corresponding  cross-validated 
smoothing  parameters,  by  Monte  Carlo  simulation,  and  by  example.  Surprisingly  large  gains  in 
asymptotic  efficiency  are  observed  between  biased  and  iinbiased  cross-validation  when  the  under¬ 
lying  density  is  sufficiently  smooth.  Reliability  of  cross-validation  for  finite  samples  is  discussed. 
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1*  Introduction 

1,1.  Motivation  and  Scope  of  Paper 

Much  theoretical  progress  has  been  made  recently  with  the  important  problem  of  data-based 
methods  for  choosing  smoothing  parameters  in  nonparametric  curve  estimation  procedures  since 
the  early  work  of  Kronmal  and  Tarter  (1968),  Woodroofe  (1970),  and  Stone  (1974).  In  density 
estimation  particular  attention  has  been  paid  to  the  least-squares  cross-validation  algorithm 
described  independently  by  Rudemo  (1982)  and  Bowman  (1984).  The  sequence  of  smoothing 
parameters  produced  by  this  procedure  not  only  leads  to  consistent  density  estimates  but  is 
asymptotically  optimal  in  a  certain  sense,  as  shown  by  Hall  (1983)  and  Stone  (1984).  Recently 
Hall  and  Marron  (1985)  have  characterized  the  limiting  distribution  of  this  sequence.  This  theory 
indicates  that  these  cross-validation  [CY)  sequences  converge  at  perhaps  surprisingly  slow  rates. 
As  was  the  case  with  the  original  kernel  theory  of  Rosenblatt  and  Parzen,  the  new  theory  is 
asymptotic  in  nature,  so  that  considerable  effort  will  be  required  to  fully  imderstand  the  practical 
aspects  of  these  methods  and  their  performance  with  real  data.  For  samples  of  size  under  100 
with  Gaussian  kernel  estimates,  two  simulation  studies  have  been  completed.  First,  Scott  and 
Factor  (1981)  showed  that  the  average  behavior  of  some  earlier  CV  algorithms  was  good  for 
Gaussian  data  in  the  sense  that  the  cross-validation  smoothing  parameters  were  centered  around 
the  value  predicted  by  minimizing  mean  integrated  squared  error  {MISE).  Second,  Bowman 
(1985)  presented  a  study  using  6  sampling  densities  and  8  cross-validation  algorithms.  We  are 
unaware  of  studies  involving  much  larger  samples. 

The  goal  of  cross-validation  is  to  automatically  provide  nearly  optimally  calibrated  non¬ 
parametric  estimates,  mimicking  the  choices  of  experts  and  perhaps  surpassing  them.  Consistency 
of  cross-validation  algorithms  is  important  but  we  are  more  concerned  with  understanding  small- 
sample  reliability,  which  we  define  as  the  smallest  sample  size  for  there  is  a  90%  chance  of  being 
within  10-15%  of  the  optimal  smoothing  parameter.  This  is  a  useful  rule  of  thumb  because  even 
for  extremely  large  samples,  density  estimates  with  smoothing  parameters  outside  this  narrow 
range  are  either  distorted  or  visually  noisy.  Highly  reliable  cross-validation  algorithms  would  pro- 
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vide  scientific  reproducibility  of  density  estimates  between  laboratories,  an  important  but  elusive 
goal.  Our  objectives  are  similar  to  those  of  researchers  trying  to  retain  the  reproducibility  of  mul¬ 
tiple  linear  regression  while  introducing  transformations  and  robust  methods  via  artificial  intelli¬ 
gence  (Gale  and  Pregabon  1983).  Carroll  and  Ruppert  (1985)  have  expressed  caution  about  using 
robust  methods  “blindly.”  We  will  show  that  similar  caution  is  appropriate  for  cross-validation  of 
density  estimators. 

In  this  paper  we  introduce  a  (biased)  cross-validation  method  closely  related  to  one  investi¬ 
gated  in  Scott  and  Factor  (1981),  and  compare  it  with  the  least-squares  unbiased  cross-validation 
algorithm.  We  remark  that  our  biased  cross-validation  algorithm  is  really  a  generalized  rather 
than  ordinary  cross-validation  method  (Wahba  1977).  Both  theoretical  results  and  simulations 
indicate  that  the  new  algorithm  compares  favorably  to  the  least-squares  cross-validation  algorithm 
in  many  situations.  The  theoretical  results  explain  some  of  the  small  sample  behavior  of  several 
cross-validation  functions.  We  also  show  that  cross-validation  algorithms  can  be  unreliable  for 
samples  sizes  which  are  “too  small.”  Our  purpose  is  to  present  material  that  will  aid  the  practi¬ 
tioner  in  the  use  of  these  appealing  automatic  cross-validation  algorithms  and  to  help  facilitate 
evaluation  of  future  algorithms.  In  order  to  do  this  we  must  address  some  ofttimes  controversial 
issues  in  density  estimation:  squared  loss,  the  integrate  squared  error  and  mean  integrated  squared 
error  criteria,  adaptive  density  estimates,  sample  size  requirements,  and  assumptions  about  the 
underlying  density’s  smoothness.  Our  simulations  include  samples  of  up  to  25,600  points. 

There  is  a  rich  literature  on  data-based  smoothing  algorithms  for  nonparametric  methods. 
A  survey  of  smoothing  methods  for  density  estimation  may  be  found  in  Scott  (1986).  A  more  gen¬ 
eral  survey  was  given  by  Titterington  (1985).  In  our  discussion  we  shall  focus  on  density  estima¬ 
tion,  although  the  situation  for  nonparametric  regression  is  parallel  (Rice  1984;  Hardle  and  Mar- 
ron  1985).  With  regression,  one  must  pay  attention  to  the  interactions  among  choices  of  the 
regression  curve,  the  signal- to-noise  ratio,  and  the  distribution  of  the  noise,  whereas  we  only  need 
consider  the  density  curve  here. 
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At  this  point  it  is  helpful  to  give  a  partial  outline  of  the  paper.  It  would  be  natural  at  a 
first  reading  to  proceed  to  Sections  5  and  6  after  reading  Sections  1.2  and  2  and  only  glancing  at 
theoretical  results  in  Sections  3  and  4.  All  of  the  proofs  in  the  paper  are  presented  in  Section  10. 

1.  Introduction 

1.1.  Motivation  and  Scope  of  Paper 

1.2.  Example 

2.  Asymptotic  Mean  Integrated  Squared  Error  Theory 

3.  Cross-Validation  Algorithms  and  Theory 

3.1.  Least-Squares  (Unbiased)  Cross-Validation 

3.2.  Biased  Cross-Validation 

3.3.  Unbiased  Cross-Validation  Revisited 

3.3.1.  Augmented  Unbiased  Cross-Validation  Criterion 

3.3.2.  Asymptotic  Relative  “Vertical”  Variability 

4.  Variability  and  Asymptotic  Normality  of  Cross-Validation  Smoothing  Parameters 

4.1.  Unbiased  Cross-Validation 

4.2.  Biased  Cross-Validation 

4.3.  Asymptotic  Relative  “Horizontal”  Variability 

5.  Implementation  with  Gaussian  Kernel  and  Averaged  Shifted  Histogram  Estimates 

5.1.  Two  Introductory  Examples 

5.2,  Averaged  Shifted  Histogram  Implementation 

6.  Monte  Carlo  Study 

6.1  Small-to-Large  Sample  Behavior  with  Gaussian  Data 

6.2  Other  Densities 

7.  Some  Issues  in  Cross-Validation 

8.  Some  More  Examples 

9.  Discussion,  Conclusions,  and  other  Recent  Work 

10.  Proofs 

A  few  notations  are  used  extensively  throughout  the  paper.  We  shall  denote  the  squared  L2 
norm  of  a  function  ip  by 


00 

1 1  1 1  2^=  J  i>{xfdx  ,  (1.1) 

-00 

where  R  reminds  us  that  (1.1)  is  one  possible  measure  of  the  roughness  of  ip.  The  square  of  the 
p  -th  derivative  of  ip  will  be  denoted  by  ip^^\x  )^.  Integrals  without  limits  are  assumed  to  be  over 
the  entire  real  line. 


1.2.  Example 

We  begin  by  presenting  an  example  using  two  cross-validation  functions  for  the  histogram. 
For  an  equally  spaced  histogram  of  a  random  sample  of  size  n  with  bin  width  h  ,  let  [k )  be  the 
bin  count  in  the  ^-th  bin  [kh  ,{k  -{-1)A  ),  where  without  loss  of  generality  we  may  assume  the  mesh 
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includes  zero.  In  Section  3.1,  we  show  that  the  least-squares  cross-validation  function  is 


(1.2) 


and,  in  Section  3.2,  propose  a  biased  cross-validation  function 


The  (automatic)  cross-validation  smoothing  parameter  minimizes  the  sample  cross-validation 
function.  In  Figure  1,  we  plot  e  q  and  c  i  for  a  relatively  large  sample  of  10,000  standard  normal 
points  (actually  iV’(5,l)),  for  which  the  asymptotic  L2  theory  (Scott  1979)  predicts  A  =.162 
minimizes  the  mean  integrated  squared  error.  The  difference  between  these  plots  is  striking.  To 
be  sure,  most  of  the  “vertical”  noise  in  these  plots  is  due  to  a  bin  edge  effect.  This  phenomena 
was  observed  even  with  much  smaller  samples  by  Rudemo  (1982).  But  the  difference  in  noise  lev¬ 
els  has  deeper  implications.  We  claim  these  pictures  reveal  a  great  deal  about  theoretical  and 
practical  behavior  of  these  cross-validation  techniques  for  reasonable  sample  sizes  and  suggest 
differences  in  the  “horizontal”  noise  of  smoothing  parameters  obtained  by  the  two  cross-validation 
methods.  Roughly  speaking,  in  Figure  lb  we  are  seeing  the  between-sample  “vertical”  variation 
because  of  the  relatively  small  correlation  between  heights  of  adjacent  or  partially  overlapping 
bins.  An  effort  to  understand  these  plots  was  the  motivation  for  this  paper. 


2*  Asymptotic  Mean  Integrated  Squared  Error  Theory 

Consider  a  kernel  density  estimate  of  an  unknown  univariate  density  /  based  on  a  random 
sample  •  •  •  ,a:„  with  corresponding  empirical  cdf  : 

f  {y)  =  ,y  )dFn  (a: )  =  —  X)  {Xi  ,y  )  , 

n  n,._i 

indexed  by  a  smoothing  parameter  A  .  Our  goodness-of-fit  criterion  between  /  and  /  will  be  the 
usual  integrated  squared  error  {ISE ): 

00 

ISE  =  J[f  {y)-f  (y)]^dy.  (2.1) 

-00 

Let  MISE  =  E  {ISE ),  the  mean  integrated  squared  error. 
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We  shall  focus  our  attention  on  the  fixed  bandwidth  (symmetric)  kernel  estimator 

,,.2) 

Denote  by  the  “moments”  of  the  kernel  K :  K  {t)dt .  K  we  suppose 

A^i=  *  *  *  =A*p_i=0,  and  0<  |  \  <oo  for  some  even  p  ,  then  the  mean  integrated  squared  error 

may  be  written  as 

MISE  {h  )  =  AMISE  ( A  )  +  O  (n  A  2p  +1)  ^  (2.3) 

where  the  dominant  term  is  the  asymptotic  MISE  [AMISE )  given  by 

AMISE  [h  )  =  +  (p  !)-2A  (/  (p))  ,  (2.4) 

where  R[K)  is  defined  in  (1.1).  Expression  (2.3)  holds  if  we  assume  that  i?(if)<oo,  /  is 
absolutely  continuous  and  R{f  (Scott  1985),  Slightly  weaker  assumptions  on  /  would 

give  error  0  (A  ^^ ),  but  this  would  render  the  AMISE  (A )  expression  less  useful  in  practice  for 
small  samples  because  the  error  terms  may  vanish  quite  slowly.  The  AMISE  is  minimized  when 
h*  =0  for  example  with  p  =2,  by 

A'  ^{R{K)/[n^iR[r)]y/^.  (2.5) 

We  will  be  comparing  several  smoothing  parameters  and  we  adopt  the  following  easily  recalled 

notation:  for  a  particular  sample,  kj^isE  minimizes  MISE ,  A  *  minimizes  AMISE  ,  hisE  minimizes 

ISE  j  and  hucv  2ind  hscv  minimize  the  unbiased  and  biased  cross-validation  functions.  Notice 

that  the  last  three  smoothing  parameters  depend  upon  the  data. 

In  this  paper  we  focus  on  nonnegative  kernel  estimators,  which  is  the  case  p  =2.  For  our 
cross-validation  results,  conditions  on  the  kernel  and  density  will  be  slightly  stronger  than  those 
given  above.  Here  we  list  several  sets  of  conditions,  first  for  the  density  /  and  then  for  the  ker¬ 
nel: 

[C 1):  absolutely  continuous  and  /  integrable;  (/  \/7~ )  and  R  (/  \//  *^  )  finite 

[C  2a  ):  K  >0  symmetric  on  [-1,1];  Holder  continuous; 

[G2h):  /f"  absolutely  continuous  on  (-00,00);  if'”  continuous  on  (-1,1);  R  [K^^*)<oo. 
Throughout  this  paper  we  use  the  Gaussian  kernel  and  the  triweight  kernel,  which  is  defined  by 
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=  WO- 

The  triweight  kernel  is  the  simplest  kernel  satisfying  conditions  {C2a  ,b  ). 


(2.6) 


3.  Cross-Validation  Algorithms  and  Theory 


3*1*  Least- Squares  (Unbiased)  Cross-Validation 

Ideally,  for  each  sample,  we  would  like  to  construct  a  density  estimate  to  minimize  the 
integrated  squared  error  (2.1).  Such  a  strategy  would  seem  an  improvement  compared  to  a  stra¬ 
tegy  that  minimizes  mean  integrated  squared  error.  We  comment  further  on  this  point  in  Section 
7 .  The  least-squares  cross-validation  criterion  attempts  to  address  integrated  squared  error  rather 
than  mean  integrated  squared  error.  We  shall  introduce  the  least-squares  cross-validation  cri¬ 
terion  for  the  generalized  adaptive  kernel  estimator,  which  includes  most  commonly  used  estima¬ 
tors  such  as  the  histogram.  Replacing  /  by  the  generalized  estimator  g  in  (2.1)  and  expanding 
yields 


ISE  =  / j  [y  fdy  -  2jg  {y  )f  {y  )dy  +  J  f  {y  fdy  . 

Here, 


(3.1) 


1  ” 

9{y)  =  —  E  (y  ).  (3.2) 

where  the  kernel  depends  on  the  sample  size  n  and  may  also  depend  on  either  y  or  •  The  idea 
of  Rudemo  (1982)  and  Bowman  (1984)  is  to  find  data-based  expressions  that,  on  average,  agree 
with  the  first  two  terms  in  (3.1),  and  to  omit  the  third  term  i?  (/  ),  which  amounts  to  a  simple 
fixed  shift  of  the  entire  function  for  each  sample.  Consider  the  cross-validation  estimator 

UGV  =  jg  [y  fdy  -  —Y,9i  (^i )  (3.3) 

”  1=1 

where 


(3.4) 


Notice  that  the  divisor  has  been  changed  from  n  to  n  -1,  but  this  change  is  not  incorporated  into 


the  kernel.  Now  the  expectation  of  UCV  exactly  matches  the  MISE ,  which  is  the  expectation  of 
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(3.1),  term  by  term  since 

Egi  (xi )  =  EK^  (x,-  ,Xk)  =  E  [y  .x*  )/  {y)dy  =  E  Jg  {y  )f  (y  )dy  .  (3.5) 

Expression  (3.5)  requires  the  use  of  Fubini’s  theorem,  and  noting  that  the  expectations  are  of 

different  dimension. 

Hence,  in  the  fixed  bandwidth  case  (2,2),  exactly  unbiased  estimates  of  the  shifted  MISE  for 
nonrandom  h  are  provided  by 


UGV{h  )  =  Jf{yf--t  fi  K  )•  (3.6) 

”t=i 

We  refer  to  (3.6)  as  an  unbiased  cross-validation  (UCV)  criterion  because  its  expectation  is 

E^UCV{h)'^=:  MISE{h)- R{f  )  .  (3.7) 

Other  theoretical  expressions  such  as  (2.4)  are  only  asymptotically  correct;  that  is,  they  are 

biased  for  finite  samples. 

It  is  straightforward  to  see  that  (1.2)  follows  from  (3.3),  except  for  replacing  n  ±1  by  n  . 
Hall  (1983)  and  Stone  (1984)  have  shown  that  the  procedure  not  only  provides  a  consistent 
sequence  of  smoothing  parameters  but  is  asymptotically  optimal  in  a  certain  sense.  There  are  (at 
least)  two  other  remarkable  features  of  this  procedure,  namely,  its  unbiasedness  property,  and  its 
self-adapting  property.  To.  illustrate  the  second  property,  consider  the  fixed  kernel  estimator 
(2.2).  Proper  analysis  of  its  MISE  in  (2.3)  required  knowledge  of  the  “moments”  of  the  kernel, 
defined  below  expression  (2.2).  Such  specification  is  not  apparent  in  (3.6),  which  in  this  case 
becomes, 


UCV{h)  =  + 

nh 


*  1  X  ~Xi  X  —X4  9  X:  — Xv 


X  -Xy 

T 


-X4 

T 


(3.8) 


An  instructive  exercise  is  to  show  that  for  p  even: 


E 


x-x, 


X  -X,* 


=  i?(/  )  +  (p!)-2A2>’ (/(>’))  +  2  ±  (-1)*.!  A  2*  ^2*^  (/'*>)+  o  (A  +2). 

jb=p/2 


and 
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-!>{/)+  £  (-l)*75^»"(-2.fl(/'*»)  +  0(»"+»), 

jfc=p/2 

Hence, 

[?7C'V(A  )]  =  +  {p  ir^k^P  (/  («>))  -  i?  (/  )  +  o  („-i)  ,  (3.9) 

which  is  to  be  compared  to  (2.3),  (2.4),  and  (3.7).  Higher  order  terms  in  (3.9)  match  as  well  in  the 
MISE  expansion  if  the  necessary  derivatives  are  assumed  to  exist.  Thus  the  UGV  criterion 
automatically  “knows”  the  correct  order  of  the  kernel.  (In  fact,  Stone  (1984)  has  shown  that  the 
method  even  knows  how  many  derivatives  /  has  for  the  histogram  and  nonnegative  kernel  esti¬ 
mators.)  Notice  that  for  large  sample  sizes,  the  UGV  essentially  provides  estimates  of  i?  (/  f^^). 

In  view  of  Figure  1,  the  “vertical”  variability  of  UGV[h  )  is  of  interest.  Assume  now  that 
the  kernels  are  symmetric  and  have  finite  support  on  [-1,1].  K  we  define 

'^c)  =  j  K{w)K{w -\-c)dw  -2K{c)  (3.10) 

and  let  c,*j*  =  (a:,-  -ary  )/h  ,  then  (3.8)  becomes  (replacing  n  -1  by  n  ) 

The  following  theorem  provides  the  mean  and  variance  of  the  unbiased  cross-validation  function 
(3.11)  for  fixed  h  . 

Theorem  3.1.  For  the  unbiased  cross-validation  kernel  criterion  (3.11) 

E[UGV{h)]  =AMISE{h)-R{f  )  -F  0(n-^)  (3.12) 

Var  [UCV{h  )]  =  -[R  (/  3/2)  -R[ff]  +  0  [l/n^h  +  A  Vn  )  .  (3.13) 

The  variance  of  the  histogram  criterion  (1.2)  is  also  given  by  (3.13). 

We  prove  this  in  Section  10.1.  The  rate  0(n“^)  of  the  leading  term  in  (3.13)  was  noted  by 
Rudemo  (1982).  Observe  that  as  a  consequence  of  Jensen inequality  the  first  term  in  (3.13)  is 
nonnegative. 
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Remark:  For  the  example  in  Figure  1,  \/Var  =.00222  for  h  =h  *=.162.  This  noise,  indicated 
by  the  longer  vertical  line  in  the  bottom  right  comer  of  Figure  lb,  is  much  greater  than  the 
observed  noise.  We  shall  return  to  this  point  in  Section  3.3. 


3.2*  Biased  Cross-Validation 

The  asymptotic  expansion  for  the  mean  integrated  squared  error  as  given  in  (2.4)  contains 
only  one  unknown  quantity,  R  (/  (p)).  One  natural  estimator  is  R  (/  where  /  is  the  fixed 
bandwidth  kernel  estimator.  Scott,  Tapia,  and  Thompson  (1977)  used  this  estimator  in  a  fixed 
point  algorithm  for  choosing  h  in  the  case  p  =2.  However,  the  following  lemma  (proved  in  Sec¬ 
tion  10.2)  shows  that  this  estimator  is  deficient  asymptotically  and  indicates  how  an  improved 
estimator  can  be  constructed. 


Lemma  3.2;  Suppose  that  derivatives  of  order  p  +2  of  the  density  f  and  kernel  K  exist  and  are 
continuous.  Then 


E  [R  if  ('’))]  =  (/  (P))  +  +0{h^).  (3.14) 

Notice  that  for  smoothing  parameters  of  the  optimal  order  h*  =  the  kernel 

estimate  provides  a  positively  biased  estimate  of  R(/f^^),  but  by  an  asymptotically  constant 
amount.  Silverman  (1978)  based  his  visual  “test  graph”  method  for  choosing  h  on  another  char¬ 
acterization  of  this  asymptotic  bias  in  the  L  ^  norm.  An  improved  estimate  of  R{f  ^^  ^)  is 

R(/(p))^R(/(p)-A^^.  (3.15) 

Special  Case  p  —2:  For  the  important  case  of  the  nonnegative  kernel  method  when  p  =2,  let 


Then 


<j>[c)  =  JK^^{w)K^*{w -\-c  )dw  . 


(3.16) 


where  (again)  c,-y  =(a:,*-ary  )/A  .  Using  this  together  with  the  correction  (3.15)  in  the  AMISE 
expression  (2.4)  defines  a  biased  cross-validation  function: 
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Notice  that  the  two  R  (iC")/nA®  terms  cancel.  Observe  the  similarities  between  (3.11)  and  (3.17); 
both  are  C/ -statistics  but  with  different  kernels.  Expression  (3.17)  will  provide  useful  estimates  of 
the  MISE . 

Then  we  have  the  following  interesting  results  for  the  biased  cross-validation  estimator. 

Theorem  3.2:  For  a  nonnegative  kernel  estimator  satisfying  conditions  (Cl)  and  {C  2b  ),  the  esti¬ 
mator  BCV[h  )  is  asymptotically  normal  with  mean  and  variance 

E  [BGV{h  )]  =  AMISE{h  )  +  O  (n”^)  (3.18) 

Var  [BCV{h  )]  =  R  {<f>)R  (/  )  /  (8n )  +  O  {h  /n%  (3.19) 

For  the  histogram  cross-validation  estimator  given  by  (1.3), 

Var  [e  ^{h)]=R{f  )/(12n  )  +  O  (n  -^)  .  (3.20) 

For  h  =  O  (n"^/®),  Var  =  O  (n"®/^)  in  (3.19).  It  follows  from  (3.18)  and  (2.3)  that  the  bias 
in  BCV{h )  is  O  [n  ^).  Thus  the  squared  bias  is  O  (ji~^),  which  is  of  lower  order  than  the  vari¬ 
ance  by  the  factor  h  .  Hence  variance  dominates  “vertical”  mean  squared  error.  Note  that  the 
results  of  Theorem  3.1  and  Theorem  3.2  are  not  to  comparable  orders,  since  Var  =  0(n"^)  in 
(3.13).  This  discrepancy  is  resolved  in  the  next  section.  From  (3.20)  we  may  compute 
VVar  =.0000381  at  A  *  =.162,  which  closely  approximates  the  observed  variation  in  Figure  la, 
as  indicated  by  the  small  vertical  line  in  the  bottom  right  corner. 

It  follows  from  this  theorem  that  a  consistent  sequence  of  smoothing  parameters  can  be 
found. 

Corollary  3.2:  Let  hscv  rninimize  (3.17)  over  (0,6A  *)  for  any  b  >1.  Then 

plim  {Ibcv/^*)  =  1 


(3.21) 
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3,3,  Unbiased  Cross-Validation  Revisited 

3.3.1*  Augmented  Unbiased  Cross-Validation  Criterion 

The  reason  that  the  variation  computed  in  Theorem  3.1  is  not  comparable  to  that  of 
Theorem  3.2  is  that  Theorem  3.1  measures  the  vertical  variation  of  the  UCV  curve  about  the 
level  MISE —R  (/  )  rather  than  the  MISE  level,  which  is  converging  to  0,  The  vertical  variation 
of  the  entire  curve  has  no  effect  on  the  location  of  the  minimum  we  are  interested  in.  Bowman 
(1984)  method  of  derivation  gave  the  following  augmented  UCV(h)  formula,  which  Hall  (1983) 
argued  is  the  correct  form  for  theoretical  analysis: 

AUCV{h  )  =  R{f)-l±  (x, )  +  I  E  /  K-  )-R{f)  (3.22) 

With  this  change,  Theorem  3.1  is  revised  to  give  variance  results  of  the  same  order  as  that  in 
Theorem  3.2. 

Theorem  3*3:  For  nonnegative  kernel  estimators  satisfying  conditions  (G  1)  and  ((72a  ), 

Var  lA[/CV(h  )]  =  2R  {^)R  (/  )  /  (n^A  )  +  O  (A  /n^).  (3.23) 

For  the  histogram,  we  augment  (1.2)  and  find 

Var  [eo(A  )  +  -^  I]  /  (a:,- )  -  R  [f  )]  =  h^R  {fVT  )/«  +  2R  (/  )/{n%  )  +  o  {n-^/%  (3.24) 

«,*=1 

Corollary  3.3:  Let  minimize  {8,22}  or  equivalently  (8,11)  over  (aA*,6A*)  for  arbitrarily 

small  a  and  large  b  .  Then 

plim  {hucv/^*)  =  1  • 

n  -+00 

From  (3.24)  we  compute  \/Var  =.0003394  at  A  *=.162,  indicated  by  the  smaller  vertical 
line  in  the  bottom  right  comer  of  Figure  lb.  This  closely  approximates  the  observed  variation. 
However,  this  standard  deviation  is  8.9  times  larger  than  for  the  biased  criterion  in  Figure  la. 
Conditions  (Cl)  are  much  stronger  than  those  required  by  Hall  and  Marron  (1985)  due  to  our 
different  approach. 
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3.3.2.  Asymptotic  Relative  ' '  VerticaK  '  Variability 

It  is  now  a  simple  matter  to  compare  Theorems  3.2  and  3.3  for  kernel  estimates.  The  rela¬ 
tive  variability  may  be  defined  by  the  square  root  of  the  ratio  of  the  variances  (3.23)  and  (3.19): 

11/2 


ratio  — 


/^2 


(3.25) 


which  only  depends  on  the  kernel.  This  ratio  exceeds  10  for  most  kernels.  In  the  first  part  of 
Table  I,  we  have  computed  this  ratio  for  several  practical  kernels. 

The  extent  to  which  the  “vertical”  noise  is  converted  to  “horizontal”  noise  is  examined 
empirically  in  Section  6  and  theoretically  in  Section  4. 


4.  Variability  and  Asymptotic  Normality  of  CV  Smoothing  Parameters 

In  Section  3  we  characterized  the  “vertical”  variability  of  the  cross-validation  functions.  Of 
more  practical  interest  is  the  “horizontal”  variability  of  the  actual  cross-validation  estimates 
huQY  and  h^Qy .  In  terms  of  ratios,  we  show  that  about  half  of  the  “vertical”  error  is  translated 
into  “horizontal”  error. 


4.1.  Unbiased  Cross-Validation 

Hall  and  Marron  (1985)  investigate  the  variability  of  hucv  about  the  idealized  target  hjsE 
and  show  that  hucv-^iSE  is  asymptotically  normal  {AN),  Because  (as  we  will  see  in  Sections  6 
and  7)  hucv  ^.nd  hjsE  are  often  negatively  correlated,  we  now  compute  the  variation  of  hucv  >  or 
equivalently,  of  hucv'-^  *  •  We  examine  the  first  derivative  of  UCV{h )  given  in  (3.11),  since  the 
extra  terms  in  the  augmented  criterion  (3.22)  do  not  involve  h  .  Let  7_,.(c  )  and  '7_(c  )  define  ^(c  ) 
given  in  (3.10)  on  the  intervals  [0,2]  and  [-2,0],  respectively: 

l-c 

7^(c  )  =  j  K{w)K{w -\-c)dw  -2K{c)  0<c  <2 

-1 

and  7_(c  )  as  in  (4.1)  with  limits  -l-c  and  1.  Consider  the  derivative  of  7+(c,y  ): 


(4,1) 
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-^'7+(c.7)  =  /  K{w)K'{w+Cij)dw  +  0<c,-y  <2  , 

where  the  other  term  involving  the  derivative  of  the  upper  endpoint  in  the  integral  vanishes  since 
if '(l)=0.  K  we  define 


p{c)  ==  c  JK{w  )K*{w  H-c  )dw  -  2cif '(c  )  -2<c  <2 

and  zero  elsewhere,  then  hucy  satisfies 


or,  equivalently, 


■^UCV{k) 


UCV 


=  0 


(4.2) 


EEW<=.7)  +  P(<=.7)] 


*  <y 


h^hn 


=  -nR  {K)/2  . 


(4.3) 


Hall  and  Marron  have  shown  that  the  left-hand  side  (which  is  a  degenerate  martingale)  is  AN .  In 
Section  10.5  we  compute  the  moments  and  find 


Lemma  4.1:  Under  conditions  (Cl)  and  (C2a), 

EEWc.7)  +  p(c.;)]  =  AN{-n^h^ixiR  {f")/2  ,  n^hR(p)R  (/  )/2}  (4  4) 

Now  plim  [hjjcY /h* )  =  1  so  that  we  may  replace  hucv  by  A  *  in  the  variance.  Hence  (4.3) 

n  —*oo  '  ' 

becomes 

-n  %cvfJtiR  (/")/2  =  AN  {-fiR  {K  )/2,n%'R  {p)R  (/  )/2}  .  (4.5) 

Dividing  we  have 

lu^cv  =  AN  {R{K)/  [n  p^R  (/")]  ,  2h‘ R{p)R  (/  )  /  [n  if''?]}  •  (4-6) 

But  the  mean  is  simply  (A  *  )^  by  (2.5).  Hence 

(hucv /h^f  =  AN  {I,  2R  {p)R  (/  )  /  [n\h-fp^^R  (/»)^}  •  (4.7) 

Since  the  variance  —►0  as  n  — >oo,  we  may  apply  the  delta  method  (Serfiing  1980)  with  g  (x 

which  reduces  the  variance  by  the  factor  25.  Multiplying  through  by  A  * ,  we  have 


Theorem  4.1:  For  a  nonnegative  kernel  estimator  satisfying  conditions  (Cl)  and  (C2a)^ 


hucv  =AN{h\2R  {p)R  (/  )  /  [25n2A  'Wi?  (/")"]}  • 
Now  set  h*  =  ;  then  the  standard  deviation  is  given  by 


(4.8) 
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\/2c  _ 

<T(At/CK-A  *)  =  (/  ) 


^  ^  V 

The  relative  error  of  hxjcv  O 


(4.9) 


4*2*  Biased  Cross-Validation 


In  a  similar  fashion  we  may  investigate  the  limiting  distribution  of  Kbcv-^  *  • 


Define 


'il^[c)=c  j  K*\w)K^^^{w -\~c)dw  -2<c  <2  (4.10) 

and  zero  elsewhere.  Then  taking  the  derivative  of  (3.17),  we  find 

EE['A(c.7  )  +  V'(c,7 )] 

i<i 

In  Section  10.6  we  compute  the  first  two  asymptotic  moments  [AM)  of  (4.11)  and  obtain 


h=h„ 


=  -2nR  [K]  /  ni 


(4.11) 


Lemma  4.2:  Under  conditions  (Cl)  and  (C2a,b), 

EEW<:.7)  +  V>(c.-y  )]=  AM  {-2n^h^R  {/")  ,  nHR  [^)R  (/  )/2}  (4  12) 

Again  plim  {hscv *)  =  1,  so  that  we  may  use  A  *  in  the  variance.  In  a  direct  fashion  we  find 

n  — ►©© 

hB%v  -  AM{{h*f  ,h* R{mU  )  /  [^n-^R{rf\}  .  (4.13) 

Applying  the  delta  method  as  before, 


Theorem  4*2:  Under  the  conditions  of  Lemma  4-^ 

hcv  =AM{h*  ,R{m{f)/[^00n%h*fR{f^*f]}  (4.14) 

and 

r.  -V2 

<t(W-a  *)  =  ■  (4-i5) 

We  remark  that  we  believe  it  can  be  shown  that  A^cv  is  AN  in  (4.14). 


4.3.  Asymptotic  Relative  “Horizontal”  Variability 

Comparing  the  standard  deviations  in  (4.9)  and  (4.15),  we  see  that  the  asymptotic  relative 
“horizontal”  efficiency  defined  as  the  ratio  of  these  standard  deviations  (not  variances)  is 
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ratio 


=  -L  /Z 
/il  V  R 


w 


H  V  R[^)- 

Compare  this  to  the  ratio  for  the  “vertical”  noise  given  in  expression  (3.25).  For  the  triweight 
kernel  this  ratio  is  4.98;  see  Table  I.  The  usefulness  of  these  results  in  practice  is  discussed  in  the 
remainder  of  the  paper. 


5.  Implementation  with  Gaussian  Kernel  and  Averaged  Shifted  Histogram  Estima¬ 
tors 


5*1.  Two  Introductory  Examples 


The  development  thus  far  requires  kernels  with  finite  support.  However,  it  extends  to  ker¬ 
nels  with  exponentially  decreasing  tails  as  is  the  case  with  the  Gaussian  kernel.  For  this  impor¬ 
tant  case,  which  was  considered  by  Rudemo  (1982)  and  Bowman  (1984),  equations  (3.11)  and 
(3.17)  become 


UCV{h)  = 


+ 


2\/nnh  \fKn'^h 


(5.1) 


and 


(52) 


respectively.  (Equation  (5.1)  actually  replaces  terms  like  n  -fl  with  n  .)  We  plot  (5.1)  and  (5.2)  in 
Figures  2a  and  2b  for  samples  of  size  25  and  400  from  AT  (0,1),  using  data  generated  by  IMSL  rou¬ 
tine  GGNPM  with  seeds  1821291829  and  1943248741,  respectively.  For  plotting  purposes,  we 
have  augmented  UCV[h  )  as  in  equation  (3.22).  The  dotted  line  represents  the  (exact)  MISE  as  a 
function  of  h  (Fryer  1976),  which  is  given  by 


MISE{h,n)  = 


n  -1 


2\/7rnh  ’  2n  n/tt^I+P)  >/27r(2+P)  2\/7r  ' 

For  fixed  n  the  BCV  function  converges  to  0  as  A  — ►oo.  The  BCV  function  barely  exhibits 

a  local  minimum  with  n  =  25  (sometimes  it  has  none;  see  Section  6),  but  exhibits  a  clear  local 
minimum  when  n  —  400.  Heuristically,  the  BCV  indicates  the  quality  of  Iibcv  by  the  amount  of 
rise  to  the  right  of  the  minimum.  As  n  increases,  BCV  provides  reasonable  estimates  of  MISE 
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for  relatively  larger  values  of  h  beyond  .  Recall  that  MISE  is  increasingly  dominated  by 
bias  in  the  region  to  the  right  of  h$^isE  • 

The  UGV  function  does  relatively  well  in  the  high  bias  region  and  less  well  in  the  high  vari¬ 
ance  region,  which  is  to  the  left  of  h^^isB  i  ^  predicted  by  Theorem  3.3.  There  is  no  high  fre¬ 
quency  component  evident  in  individual  plots  as  was  the  case  for  the  histogram  in  Figme  la  since 
we  are  using  a  continuous  kernel.  With  n  =  400  we  have  selected  a  case  where  the  UGV  func¬ 
tion  has  a  minimum  well  to  the  left  of  ht^jsE  \  see  Section  6.1.  (The  minima  in  Figure  2b  are 
0.142,  0.330,  and  0.389).  Rudemo  in  a  draft  of  his  1982  paper  observed  this  (occasional)  behavior 
for  smaller  samples  and  speculated  it  was  consistent  with  features  in  the  data.  In  Figure  2c  we 
plot  the  two  GV  estimates  along  with  the  true  density  (the  h^js£  estimate  is  quite  similar  to  the 
BGV  estimate).  The  density  estimate  reveals  the  illusory  multimodal  feature  that  attracted  the 
UGV  function.  In  Figure  2b,  observe  the  inflection  points  in  both  GV  curves  in  the  neighbor¬ 
hood  of  the  other's  minimum.  The  UGV  (5.1)  also  eventually  converges  to  0  (the  augmented  ver¬ 
sion  to  approximately  R  (/  ));  the  curve  in  Figure  2b  increases  monotonically  to  -.579  on  the  log 
scale. 

For  n  =400,  evaluating  (5.1)  and  (5,2)  for  each  h  took  more  than  1.1  CPU  minutes  on  a 
VAX  11/750.  Figure  2b  required  several  hours  of  CPU.  Clearly  an  alternative  implementation  is 
required  for  even  moderate  sample  sizes. 

5,2.  Averaged  Shifted  Histogram  Implementation 

In  order  to  carry  out  an  extensive  Monte  Carlo  study,  it  is  necessary  to  find  a  more  compu¬ 
tationally  feasible  method  than  the  very  slow  Gaussian  kernel  implementation  given  above.  Much 
faster  evaluations  of  (3.11)  and  (3.17)  are  possible  with  finite  support  kernels.  Furthermore,  a  ker¬ 
nel  procedure  using  binned  data  accelerates  GV  algorithms  even  more,  for  example,  Silverman's 
(1982)  Fast  Fourier  Transform  algorithm.  Another  procedure  that  takes  advantage  of  binned 
data  is  the  averaged  shifted  histogram  (ASH)  (Scott  1985).  An  ASH  is  the  (weighted)  average  of 
m  histograms,  each  with  bin  width  h  but  with  bin  mesh  origins  at  integer  multiples  of 
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6  =  h  /m  j  and  is  given  by; 

^  ITI  —  1 

f m{y)  =  «’m(»W*+‘)  fory  in  /*  ,  (5.3) 

where  (» )  are  the  weights  and  u^k  )  is  the  bin  count  for  the  k  -th  bin  /*  =  [A;  8,{k  +1)6).  The 
weights  corresponding  to  the  triweight  kernel  (2.6)  are 

W'm  (i )  =  Cm  [l-(*  /mf\^  for  I  J  I  <  m  ,  (5.4) 

where  c„  is  a  normalizing  constant  so  that  (1 )  =  m  given  by 


c„  =  35  /  [32(1  -  l/4m2)(l  +  l/4m2  +  5/24m^)]  . 

The  UCV  formula  (3.3)  for  /„  is  easily  evaluated.  The  term  ^ f mivT^V  is  computed 

directly.  The  term  (/m  V.  )  in  (3.3)  and  (3.4)  is  simply  equal  to 


u^k )  5% 


(0) 

h 


(5.5) 


where  =/m(^5)is  the  value  of  / „  A  •  practice  the  sum  in  (5.5)  involves  perhaps  a  few 
hundred  terms.  For  m  >10  (i.e.  <5  sufficiently  small)  the  behavior  of  the  kernel  and  ASH  estima¬ 
tors  is  virtually  identical;  in  particular,  similar  values  of  the  smoothing  parameter  h  give  nearly 
identical  results. 


For  BCV j  the  asymptotic  theory  for  the  ASH  involves  both  R  (/')  and  R  (/"),  which  is 
unfortunate.  However,  the  frequency  polygon  (linear  interpolator)  of  the  ASH  (FP-ASH)  requires 
only  R  (/”)*  We  cannot  use  binned  data  with  UCV  on  FP-ASH  since  we  would  need  to  know 
(a:,* )  -  i.e.  need  to  know  all  the  a:,-  exactly  and  not  just  (or  equivalently,  the  bin  counts)  as  in 
ASH  case.  Again  we  emphasize  that  for  m  >10,  the  ordinary  kernel,  ASH,  and  FP-ASH  are 
essentially  the  same  for  the  same  h  . 


The  asymptotic  MISE  expression  for  the  FP-ASH  is  (Scott  1985) 


AMISE  = 


2Ry,  + 

Znh 


-h 


2m" 


4- 


49 


720m' 


R{f") 


where  mR^  =  XI (*  7to  =  XI (*  (^  )•  Our  estimate  of 


i?  (/'')  turns  out  to  be: 
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m  -1 


2wJ^i  +  4[w^w  i)^  +  2  ^  {wi+i-2wi  +Wi^if 


1=1 


(5.6) 


where  we  denote  «;„  (/ )  by  and  5*  is  defined  below  equation  (5.5).  These  may  be  computed  in 
closed  form  using  MACSYMA  (the  triweight  kernel  formulae  are  available  from  us). 


6.  Monte  Carlo  Study 

6.1.  Small-to-Large  Sample  Behavior  with  Gaussian  Data 

In  this  section  we  study  the  results  of  simulations  based  on  samples  from  a  standard  normal 
distribution  for  sample  sizes  n  =  25,  100,  400,  1600,  6400,  and  25600  with  repetitions  of  250,  200, 
150,  100,  100,  and  100,  respectively,  ASH  and  FP-ASH  estimators  with  a  triweight  kernel  were 
used  as  described  above.  The  5’s  chosen  were  .15,  .10,  .05,  .025,  .02,  and  .01,  respectively.  For 
each  sample,  ISE ’s  corresponding  to  4  different  bandwidths  h  (or  equivalently  m  since  k  =  m  6) 
were  computed  numerically:  hj^is£  ,  h^sE  t  ^bcv  ?  ^md  ~hucv  •  The  value  Iij^e  ,  which  minimizes  the 
ISE  for  a  particular  sample,  was  found  by  searching  over  integer  values  m  . 

In  Figure  3  we  plot  frequency  polygons  of  the  cross-validated  smoothing  parameters.  The 
vertical  lines  indicate  the  set  of  k ’s  examined  (multiples  of  S).  In  Table  11  we  present  some  sum¬ 
mary  statistics.  We  note  immediately  that  in  103/250  samples  with  n  =25,  the  BCV  function 
had  no  local  minima  (compare  Figure  2a).  The  average  of  hucv  when  n  =25  is  reasonable,  but 
only  a  relatively  few  individual  samples  are  close  to  hj^j^E  •  (Of  course,  perhaps  hfs£  isn’t  close. 
We  check  this  below.)  We  have  not  found  any  samples  where  the  BCV  failed  to  have  a  local 
minimum  for  n  >40.  (For  other  densities  this  threshold  is  higher.)  On  the  other  hand,  the  biased 
CV  estimates  tighten  up  quickly  beyond  this  threshold  so  that  the  “worst”  case  for  n  >1600  is 
quite  close  to  (a  reasonable  target  as  we  discuss  in  Section  7).  The  unbiased  CV  procedure 

continues  to  be  attracted  to  spurious  (rough)  estimates  even  with  n  =25600.  Its  convergence  to 
normality  is  also  apparently  slower.  The  asymptotic  theory  predicts  a  ratio  of  “vertical”  stan¬ 
dard  errors  of  the  CV  curves  of  11.65  (which  was  observed  in  the  simulations)  and  a  ratio  of 
“horizontal”  standard  errors  of  CV  smoothing  parameters  of  4.98;  see  Table  I.  In  Table  II  we 
see  that  the  finite  sample  ratio  is  reasonably  close  to  4.98  for  moderate  sample  sizes  and  that 
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expressions  (4.15)  and  (4.9),  which  yield  o-(Abcv)=  .250n~®/^°  and  a(hucv)  =  1.243n“®/^°,  are 
remarkably  accurate. 

A  more  detailed  study  of  the  individual  results  for  n  =400  and  n  =25600  is  worthwhile.  In 
Figure  4  we  compare  the  various  smoothing  parameters.  Surprisingly,  there  is  a  negative  correla¬ 
tion  between  hjsE  ^ucv  (—-41  and  —.38,  respectively);  see  Section  7.  The  ^bcv  clusters  more 
tightly  around  correlations  with  hjsE  s-re  -.44  and  -.16,  respectively).  For  the  150 

repetitions  with  n=400,  41  had  hucv  <-85  (.85  was  the  smallest  observed  Iibcv  value).  In  23  of 
150  samples  the  UCV  curve  had  2  minima,  always  one  less  and  one  greater  than  .85.  Seven  of 
these  had  a  more  reasonable  local  minimum  near  h  * .  Sixteen  (all  <.85)  were  local  minima  com¬ 
pared  to  a  reasonable  lii;cv  i^car  h  .  When  n=25600,  only  2  of  100  UCV  curves  had  a  second 
(local)  minimum,  but  in  both  cases  the  global  minimizer  was  more  reasonable.  None  of  the  BCV 
curves  had  any  other  local  minima  over  the  range  searched. 

In  Figure  5  the  numerically  computed  ISE  *s  of  these  samples  are  displayed.  From  Figures 
5a  and  5d,  is  only  occasionally  grossly  inefficient  relative  to  .  In  Figures  5b  and  5e  we 

see  that  the  BCV  almost  dominates  the  UCV  estimates  with  respect  to  ISE !  We  try  to  under¬ 
stand  the  BCV  estimates  in  Section  7.  Figures  5c  and  5f  are  presented  for  completeness. 

Using  the  Hall  and  Marron  (1985)  formulae  for  the  triweight  kernel  and  Gaussian  data,  we 
obtain  (T{hrsE)  =  1.304n-^/‘°  and  cr(hucv -hiss)  =  2.08171“®/^°.  Since  (tQiucy)  =  1.243n-®/^°,  it 
follows  that  there  is  indeed  a  negative  correlation  between  hucv  ^.nd  hjsE  •  With  the  data  above, 
we  computed  the  sample  version  of  (^(hucv -^ise)  as  .3464  and  .0918  for  n  =400  and  n  =25600, 
respectively,  which  agree  closely  with  the  theoretical  predictions  of  .3448  and  .0990.  Thus  while 
the  variability  of  hucv  and  hjsE  is  similar,  the  negative  correlation  suggests  they  are  often  on 
opposite  sides  of  h^^jsE  •  We  have  seen  how  Hbcv  ?  which  is  very  close  to  h^^j^E  for  large  samples, 
generally  corresponds  to  estimates  with  integrated  squared  errors  smaller  than  using  hu^y . 
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6.2*  Other  Densities 

Similar  simulations  were  performed  for  three  other  densities:  Cauchy,  Lognormal  (exponen¬ 
tial  of  standard  Gaussian  random  variable),  and  a  mixture  given  by 

/(z)  =  .75  <j>{x-0,l)  + .2f> 

where  </>{x  is  the  normal  density  with  mean  fi  and  variance  Each  of  these  densities  can 

be  adequately  estimated  by  a  fixed  bandwidth  estimator,  but  an  appropriate  variable  bandwidth 
estimator  would  be  useful  for  small  samples.  The  efficiency  of  a  fixed  bandwidth  nonnegative  ker¬ 
nel  estimator  relative  to  a  variable  bandwidth  nonnegative  kernel  estimator  with  h  =  hj  (that  is, 
a  different  h  for  every  point  estimate  /  {x  ))  may  be  seen  to  be 


AMISE {adaptive  )  / 

fixff  {x^ 

1/5 

dx 

AMISE  ( /  ixed ) 

Jf'ixfdxj 

[75 — 

This  ratio  is  equal  to  .915,  .767,  .640,  and  .728,  for  the  Gaussian,  Cauchy,  Lognormal,  and  Mix¬ 
ture  densities,  respectively. 

In  Figure  6  we  plot  histograms  of  the  CV  estimates  for  100  repetitions  with  n  =1600.  The 
Cauchy  simulations  (25 <n  <25600)  were  similar  to  the  Gaussian  results  in  Section  6.1  except 
that  17%  of  the  BCV  estimates  failed  to  exist  for  n  =100  and  the  ratios  of  “horizontal”  standard 
errors  increased  to  only  3.0;  see  Table  HI. 

The  lognormal  and  mixture  results  have  an  interesting  twist.  Notice  the  BOV  estimates  are 
shifted  to  the  right  from  the  UCV  estimates.  BCV  failures  were  observed  at  n  =400.  The 
UGV  continued  to  perform  as  usual;  average  behavior  close  to  with  high  variability.  The 

BCV  was  definitely  biased  upward  for  moderate  sample  sizes  (not  including  samples  where  BCV 
did  not  exist),  although  the  bias  vanishes  by  n  =25600.  We  understand  this  phenomenon  as  fol¬ 
lows:  the  estimates  which  are  optimal  with  respect  to  ISE  are  relatively  rough  or  noisy.  This  is 
not  a  defect  of  L  2  error  but  of  the  fixed  bandwidth  estimator;  see  Section  7. 

We  examine  two  particular  examples.  In  Figure  7,  density  estimates  of  a  lognormal  sample 
with  n=1600  are  shown.  We  used  ^=.015  and  cross-validated  over  the  interval  (-1,14).  With 
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this  sample  A  =.21  (m  =14)  for  the  UGV ,  ISE  j  and  MISE  criteria  while  A^0y-=.33  (m  =22). 
The  minimum  ISE  estimate  in  Figure  7b  is  quite  rough,  even  near  the  mode.  The  BCV  estimate 
in  Figure  7a  smoothes  more  appropriately  near  the  mode  and  has  somewhat  reduced  noise  in  the 
tail  (its  ISE  is  34%  larger).  It  is  interesting  to  note  that  90%  of  R  (/^^)  comes  from  the  interval 
(0,.0257)!  (We  remark  that  an  adaptive  estimator  would  not  be  much  improved  near  the  mode 
but  rather  in  the  tail.)  Li  and  errors  are  minimized  for  A  =.225  and  .165,  respectively. 

Figure  8  is  a  plot  of  kernel  estimates  of  a  mixture  sample  with  n  =400.  Again  5=.015.  For 
this  sample,  —  -615,  hjs£  =  .510,  hucv  =  *480,  and  h^cv  =  *870  (with  ISE  greater  by 

55%).  Li  and  errors  are  minimized  for  A  =.525  and  .540,  (m  =35  and  36),  respectively.  The 
UCV  estimator  is  best  in  the  narrow  peak  while  the  BCV  is  better  in  the  larger  peak.  For  a 
fixed  bandwidth  estimator,  we  prefer  the  estimate  in  Figure  8a  (consistent  with  earlier  recommen¬ 
dations  of  Fryer  (1976)  to  slightly  exceed  A  ).  On  the  other  hand  for  small  samples,  occasionally 
large  h^cy ’s  are  produced  that  obscure  the  bimodal  feature. 

7,  Some  Issues  in  Cross-Validation 

One  issue  that  reoccurs  is  whether  we  should  use  with  a  smoothing  parameter  that  minim¬ 
izes  MISE  (or  AMISE )  or  whether  we  should  minimize  the  ISE  for  the  data  at  hand.  In  theory, 
we  should  address  ISE .  Two  factors  primarily  affect  the  (optimal)  ISE  of  a  density  estimate  for 
individual  samples.  The  first  is  our  use  of  a  fixed  bandwidth  estimator  when  a  variable 
bandwidth  estimator  is  more  appropriate;  see  equation  (6.1).  We  do  not  address  this  subject  in 
any  more  detail,  except  to  note  that  its  cross-validation  is  more  delicate  due  to  an  increase  in 
number  of  smoothing  parameters.  The  second  factor  is  variation  in  the  lower  order  sample 
moments.  While  no  choice  of  smoothing  parameter  can  compensate  for  a  shift  in  mean,  it  is  pos¬ 
sible  to  reduce  ISE  due  to  variation  in  the  sample  variance.  If  d’<cr,  then  choose  h  >h*  and 
vice  versa.  In  practice  we  cannot  expect  a  cross-validation  method  to  successfully  mimic  the 
behavior  of  .  To  do  so  would  require  guessing  whether  a>(7  or  vice  versa.  Clearly  this 
requires  knowledge  about  the  unknown  density.  Thus  we  do  not  expect  much,  if  any,  improve¬ 
ment  for  CV  methods  attempting  to  minimize  ISE  compared  to  those  seeking  the  single  fixed 
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bandwidth  minimizing  MISE  as  we  saw  in  Section  6.  Bowman  (1985)  found  that  using  a  simple 
rule  such  as  h  =  an  worked  (embarrassingly)  well  with  respect  to  ISE  for  a  Gaussian  kernel 
estimator.  Such  a  rule  is  close  to  a  MISE  rule  for  many  unimodal  densities. 

It  is  easy  to  demonstrate  these  observations  by  simulation.  We  chose  standard  Gaussian 
data  since  fixed  bandwidth  estimates  are  91.5%  efficient.  We  used  the  moderate  and  large  sample 
sizes  of  400  and  25,600  with  150  and  100  repetitions,  respectively,  of  Section  6.  In  Figure  9  we 
plot  hjsE  vs  a  for  sample  sizes  400  and  25,600.  The  correlations  between  hjs£  and  a  were  -.632 
and  -.688,  respectively.  When  the  same  samples  were  shifted  to  have  zero  sample  mean,  these 
correlations  changed  little.  The  sample  mean  had  a  much  smaller  effect.  Thus  we  see  that  any 
benefits  to  be  gained  from  minimizing  ISE  rather  than  MISE  are  swamped  by  the  much  larger 
asymptotic  error  of  the  algorithms  which  pursue  the  former  goal. 

We  also  find  that  GV  performance  depends  strongly  on  sample  size  and  the  underlying  den¬ 
sity.  Specifically,  the  conditional  probability  that  the  CV  smoothing  parameter  is  “acceptable” 
given  n  increases  rather  rapidly  from  10%  to  90%;  however,  the  location  of  this  transition  region 
may  begin  with  surprisingly  large  sample  sizes.  Further  work  characterizing  this  transition  would 
be  interesting.  With  finite  samples  we  are  limited  in  our  ability  to  adequately  estimate  all  densi¬ 
ties.  Such  fears  are  justified  but  clearly  we  are  in  a  stronger  position  than  if  we  made  a 
parametric  choice.  As  in  spectral  analysis,  we  have  a  bandwidth  that  indicates  an  upper  bound 
on  the  size  of  a  feature  that  may  be  hidden. 

Another  point  of  question  is  whether  to  use  L  2  error  or  L 1  error,  which  is  defined  by 
£*  J  I  /  -/  I  .  For  most  cases,  this  choice  makes  little  difference  between  the  optimal  estimates 
for  a  particular  density  estimator;  that  is,  the  prescription  and  competing  (possible)  density  esti¬ 
mates  (for  alternative  h ’s)  are  the  same  for  either  error  criterion.  We  postpone  discussion  of  the 
question  of  prior  assumptions  on  /  until  Section  9. 
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8.  Some  More  Examples 

Good  and  Gaskins  (1980)  present  a  large  particle  physics  data  set  (the  LRL  data),  which  is 
interesting  because  it  is  prebinned  with  8=1Q  McV .  The  authors  found  thirteen  bumps  in  a 
penalized  likelihood  estimate.  The  optimal  bin  width  using  either  histogram  criteria  (1.2)  or  (1.3) 
gives  h  —10  MtV  as  optimal. 

We  also  examined  these  data  with  a  triweight  ASH  estimator.  In  this  case  m  =2  for  UCV 
and  m  —4  for  BOV  using  the  ASH  and  FP-ASH,  respectively.  The  square  roots  of  these  esti¬ 
mates  are  shown  in  Figure  10.  AJt hough  the  13  bumps  found  by  Good  and  Gaskins  are  apparent 
in  Figure  10c,  it  is  interesting  to  speculate  why  certain  small  bumps  are  included  and  others 
excluded.  It  is  appropriate  to  recall  that  an  optimally  smoothed  density  has  a  slightly  noisy 
second  derivative,  as  shown  in  equation  (3.15)  when  p  ==2. 

We  have  implemented  a  bivariate  product  kernel  BCV  algorithm.  Details  and  an  example 
with  a  data  set  (thought  to  have  a  bimodal  density)  of  320  males  with  heart  disease  are  available 
from  us.  The  bimodal  feature  was  not  revealed  by  a  BCV  estimate,  in  the  same  manner 
observed  in  the  univariate  mixture  example  in  Section  6.2. 

9*  Discussion,  Conclusions,  and  Other  Work 
9.1.  Other  Related  Work 

Kronmal  and  Tarter  (1968)  introduced  the  first  unbiased  CV  algorithm  for  a  Fourier  series 
density  estimator.  The  algorithm  provided  unbiased  estimates  of  the  change  in  MISE  as  addi¬ 
tional  Fourier  coefficients  were  introduced.  Hart  (1985)  has  given  an  unbiased  procedure  for 
Davis’s  Fourier  integral  series  estimator.  Wahba  (1977)  had  the  first  working  BCV  algorithm, 
which  she  called  generalized  CV .  In  her  Fourier  series  estimator,  the  smoothing  parameter  is  not 
the  number  of  terms  in  the  series  (taken  to  be  n  /2)  but  a  design  parameter  in  a  tapering  window 
applied  to  the  Fourier  coefficients.  The  leading  terms  in  the  theoretical  MISE  depend  only  on 
the  magnitude  of  the  Fourier  coefficients  |  /  |  ^.  By  substituting  an  unbiased  sample  estimator 

for  I  /  I  she  derived  a  biased  cross-validation  criterion.  The  (small)  bias  results  from 
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truncation  of  the  series  in  and  Wahba  (1981)  showed  the  truncated  terms  contributed  only 
O  (n  )  towards  the  MISE .  Wahba *s  and  KuUback-Liebler  methods  were  tested  by  Scott  and 
Factor  (1981).  Our  biased  CV  algorithm  is  essentially  the  analog  of  Wahba's  procedure. 

9.2*  Partial  Explanation  for  Improvement 

The  improvement  of  BCV  over  UOV  should  not  be  viewed  as  artificial.  The  nature  of  the 
improvement  is  most  easily  seen  with  the  histogram,  for  which, 

AMISE{h)=jj-  +  j^h^R{f').  (9.1) 

Recall  that  h  =  O  (n  ^^^)  for  the  histogram.  Following  (1.2)  and  (1.3),  consider  a  third  estimate 
of  (9.1): 

OQ  1 

Now  e^{h)  is  based  on  a  central  difference  approximation  to  R  (/'),  which  is  numerically  superior 
to  the  forward  difference  approximation  leading  to  ei(A).  It  may  be  shown  that  the  “vertical” 
variances  of  cq,  e  i,  and  eg  are  2R  {f  )/n^h  ^0  [h^/n  ),  R{f)/12n^h,  and  R  {f  )/192n^h  , 
respectively.  Again  the  squared  bias  is  of  lower  order  0  (n"^).  This  is  a  remarkable  decrease  in 
the  variances.  But  for  finite  samples,  the  use  of  higher  order  derivative  approximations  will  incur 
large  bias  and  hence  the  gains  are  not  realized  except  for  extremely  large  samples.  This  is  similar 
to  the  choice  of  p  in  equation  (2.4).  Theory  suggests  choosing  p  as  large  as  possible,  whereas  in 
practice  p  =2  or  3  is  a  wiser  choice.  The  higher  order  terms  cannot  in  general  be  neglected.  But 
for  moderate  samples  e  i(A )  does  represent  a  substantial  improvement  over  e  o(/i ),  whereas  e  2(A  ) 
may  not. 

9.3.  D  iscussion 

We  have  attempted  to  evaluate  the  small-sample  properties  and  reliability  of  two  cross- 
validation  algorithms.  No  currently  available  algorithm  is  highly  reliable  for  very  small  samples. 
In  this  situation  BCV  always  oversmoothes  while  UCV  has  very  large  variance.  However,  for 
“large”  samples  cross-validation  is  highly  reliable  with  respect  to  MISE ,  Reliability  with 


“medium”  samples  is  often  achieved  with  densities  that  are  not  too  rough.  From  Tables  II  and 
in  we  see  that  our  definition  of  a  highly  reliable  OV  algorithm  is  satisfied  by  the  BCV  estimates 
for  sample  sizes  beyond  500-1000  except  for  the  lognormal  density,  which  requires  several 
thousand  points.  The  goal  of  finding  h[SE  with  CV  algorithms  remains  largely  unsolved,  as 
pointed  out  in  Section  6.1.  It  is  not  at  all  clear  that  using  h[SE  be  preferred  to  ,  given 
the  rather  peculiar  manner  by  which  the  integrated  squared  error  is  further  reduced,  as  discussed 
in  Section  7. 

While  biased  GV  has  potentially  greater  reliability  compared  to  unbiased  CV  ^  it  comes  at 
the  cost  of  additional  assumptions  on  /  .  However,  the  very  general  optimal  consistency  of 
unbiased  CV  comes  at  a  surprisingly  high  cost  in  sample  size  requirements  if  /  is  smooth. 
Asymptotically,  about  (4.98)^°/®  or  about  211  times  more  points  are  required  so  that  (^(hucv) 
equals  (^{hscv)  for  triweight  kernel.  Thus  we  have  a  tension  between  “customized”  and  “gen¬ 
eric”  CV ,  It  would  be  interesting  to  investigate  how  much  unbiased  CV  can  be  improved 
perhaps,  for  example,  by  leaving  more  than  one  point  out. 

Perhaps  most  useful  is  to  observe  the  divergence  in  behavior  of  UCV  and  BCV  algorithms. 
Agreement  or  disagreement  of  the  2  GV  parameters  provides  possible  auxiliary  information  about 
any  unusual  features  in  the  underlying  density.  Biased  CV  is  essentially  using  the  data  to  esti¬ 
mate  the  bias.  This  is  (and  should  be)  a  difficult  task  because  the  relative  contribution  of  the  bias 
and  variance  towards  the  MISE  is  in  a  ratio  of  1:4  near  optimal  smoothing.  Unbiased  CV  pro¬ 
vides  superior  bias  estimates  but  at  the  cost  of  increased  variance.  Given  the  importance  of  vari¬ 
ance  at  h  —Am/se  j  it  is  important  to  control  “vertical”  variance  more  than  current  UCV  algo¬ 
rithms  do.  Simple  local  averaging  of  the  UCV  curve  is  not  the  solution  as  one  might  have 
guessed  from  Figure  1. 

We  observe  that  the  BCV  procedure  may  be  used  to  obtain  approximate  confidence  inter¬ 
vals  for  both  hucY  and  h^cv  ?  assuming  the  latter  is  asymptotically  normal.  BCV  provides  con¬ 
sistent  estimates  of  i?(/")  as  well  as  R{f  ),  which  may  be  used  in  (4.9)  and  (4.15).  In  fact, 
Theorem  3.2  follows  from  the  fact  that  R  (/")  =  AN  {R  (/"),2i?  (<^)i?  (/  )/(n^A®)}.  Some  idea 
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of  the  reliability  of  the  CV  smoothing  parameter  can  be  drawn  from  these  estimates. 

For  sufficiently  large  data  sets  and  reasonable  densities,  reliability  is  achievable.  We  wish  to 
emphasize  that  excellent  density  estimates  are  still  possible  with  smaller  samples,  but  cannot  be 
reliably  calibrated  by  present  methodology.  We  believe  superior  unbiased  and  biased  CV  kernel 
estimators  can  be  found,  since  neither  development  attempted  to  optimize  reliability.  Perhaps 
the  more  computationally  intensive  bootstrap  methods  can  be  used  to  improve  reliability  for  small 
samples. 

Finally  we  remark  that  there  are  many  other  nonparametric  applications  where  cross- 
validation  is  desirable,  such  as  nonparametric  regression,  discrimination,  hazard  analysis  and  spec¬ 
tral  analysis.  It  would  be  interesting  to  see  how  biased  and  unbiased  cross-validation  algorithms 
compare  in  these  settings. 

10.  Proofs  of  Results 

10.1.  Proofs  of  Theorems  3.1  and  3.3  in  Sections  3.1  and  3*3.1 

We  assume  that  conditions  (Cl)  and  [C2a)  are  satisfied.  Occasionally  in  the  proofs  we 
tacitly  assume  the  existence  of  higher  order  derivatives  in  /  when  we  wish  to  investigate  expli¬ 
citly  error  terms;  however  these  derivatives  are  not  required. 


10.1.1.  Expectation  of  the  Unbiased  Cross-Validation  Function 

Recall  the  definitions  of  7+  and  7_  in  equation  (4.1).  Since  K  is  symmetric,  it  is  easy  to 
show  that  for  c  >0,  7_(“^  )  ”  '7-h(c  ),  that  is,  7  is  symmetric.  Now 
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E  =  f  f  (x) 


X  X  +2h 

J  {y)dy  +  J  {y)<iy 

x-2h  "  X  ^ 
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fl+ic  )[/  (x-hc  )  +  f  (x  +kc  )]dc 
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(10.1) 
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by  two  change  of  variables,  the  symmetry  of  7,  a  Taylor’s  series  of  /  {x  ±hc  ),  and  integrating  by 
parts  (e.g.  J f  f"  —  -R  (/')).  It  is  not  hard  to  show  that 
2  11 

Jc  *  Tf+(c  )dc  =  —JK {w  )f(3  -to  )*  K (3  )d3dw  -  a**  ,  (10.2) 

0  ^  _1  _1 

which  equals  -1/2,  0,  and  15^2/^4  for  A:— 0,2, 4, 6,  respectively.  Hence, 

Enicij  )=-hR[f)+  (/»)  -  2W4A'i?  (/"')  +  0  (A^)  .  (10.3) 

Thus  (3.12)  follows  from  (10.3),  (3.11),  and  (2.3). 


10*1.2.  Variance  of  the  Unbiased  Cross-Validation  Function 

Next  we  find  the  variance  of  (3.11).  The  analysis  of  is  parallel  to  (10.1)  with 

l+{cf  rather  than  7+(c ).  Hence  E'i{cijf=  hR{'-i)R{f  )+  0(A®);  together  with  (10.3)  we 
have 

Var  Tf(c,-y  )  =  hR  {‘i)R  [f  )- h‘^R{f  f  +  O  [h^]  .  (10.4) 

For  simplicity  of  notation,  let  7,-y  =  7(<^i/  )■  Now  Gov  (7,-y  ,-7*/ )  =  0;  here  (and  from  now  on)  we 

assume  distinct  letters  represent  unequal  subscripts.  Let 

It^Jf"{^)f{xfdx-,  /2  =  i? (/ )i? (/»);  ■ 

Consider 


Elijlik  =  !  f  [x) 


/t^(— r^)/  {y)dy 


dx 


If  (O 


A/7+(c  )[/  (a: -Ac  )  +  /  (a:  4-Ac  )]dc 


dx 


With  (10.3)  we  have 


=  h^J  f  {x  fdx  -  jfiih  +  0{h^)  . 


Gov  (Tf,-y  ,'7,-*  )  =  h^l3-  juih  ®[/i-/2]+  O  (A  ®). 


Quantities  such  as  Gov  (-7,7  ,-7]^  )  also  equal  (10.7).  Now  it  is  well-known  that 


(10.5) 

(10.6) 

(10.7) 


[ES'l.;]  =  4”  (n-l)For  7,-y  -|-  n  (n  -l)(n  -2)Gov  (7,7  ,7,*  ) 

•  <y  ^ 


With  (10.4),  (10.7),  (10.8),  and  (3.11),  we  have 


(10.8) 
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Var  UCV{h  )=-!,+  „  („-9/6)  ,  (lO.g) 

n  n  “A  n 

which  explicitly  gives  the  remainder  terms  in  (3.13),  and  proving  Theorem  3.1. 


10*1.3*  Variance  of  Augmented  l/CV  Criterion 
Comparing  (3.11)  and  (3.22),  we  see  that 

Var  AC/CV(h)==  Var  UCV(h  )  + -^Varf;  /  (^.O  +  (SE^.V -S/  (=^.- ))-(10.10) 

n  n  n 


*  <; 


Since  Ef  (x,-  )*  =  //  (x  ,  we  have 


4yar  E  /  (x.- )  =  j-[R  if  if  f\  =  -h  . 

In  (10.10),  Cov  (7,-y  ,/  [xif ))  =  0.  Consider  the  n  (n  -1)  terms  for  which  k  =i  or  k  =j  : 


(10.11) 


Elijf  (xi)  =  2hf'y+(c) 


OO 

/[/  i^fdx 


=  -hB(/^/^)+  jhWli+  O(h^). 
Since  Ef  (x^-)  =  B  (/  ),  combining  (10.12)  and  (10.3), 

Cov  (^.-y  ,/  (Xi  ))  =  -his  +  Jf^2^h^ll,-l2j. 
Together  with  (10.10),  (10.9),  and  (10.11),  we  have  proven  Theorem  3.3. 


10*2*  Proof  of  Lemma  3*2  in  Section  3.2 


Clearly 


dc 


(10.12) 


f  f  )/  (*)/  (y)  dz  dy  dx 

=  (a:-Aw)dw]  dx 

after  a  change  of  variables.  Now  the  bracketed  term  may  be  approximated  by 
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if  )’  /  '’’(a:  )/»’■  )dw  +  0  (AP+2)  . 

Now  f  w' K^^\w)dw  ~0  if  i  <p  or  for  i  odd,  and  it  equals  (-1)^  p  !  for  i  =p  and 
(_l)p+2(p  ^2)1^2/2  for  p+2.  Hence  the  sum  collapses  to  f  ^^\x)  Squaring, 

integrating,  and  noting  that  (n  -l)/n  =  1  +  0  (n~^)  completes  the  proof.  We  remark  that  since 
/ijb  =  0  for  0< A:  <p  ,  the  error  is  actually  O  (A  ^ )  if  /  exists. 

10,3.  Proof  of  Theorem  3.2  in  Section  3.2 

The  analysis  of  the  moments  of  the  biased  cross-validation  function  is  similar  to  that  in  Sec¬ 
tion  10.1,  although  much  easier  since  BCV[h  )  involves  fewer  terms  and  because  more  “moments” 
of  (10.13)  below  vanish.  We  assume  conditions  (Cl)  and  (C2a  ,6  )  are  satisfied.  We  remark  that 
condition  (C  2b  )  is  necessary  for  Theorem  4.2  but  stronger  than  necessary  by  one  order  of  deriva¬ 
tive  for  Theorem  3.2.  From  (3.16)  define 

l-c 

(f>^{c)=  J^K”{w)K”{w -\-c)dw  Q<c  <2  (10.13) 

and  <f>J^c  )  for  -2<c  <0.  Again  (j>  is  symmetric.  Now 

2  1  1-w 

Jc*0_|.(c)dc  =  J  K**{w -\-c)d€dw  .  (10.14) 

0  -10 

For  A  =0,  observe  K**{w -\-c)dc  =  ~K*{w)  and  “J_i  K  ^{w)K  ”[w  )dw  =0  since 

K\±1)  =  0.  For  A  >2, 

l-to  1-to 

J  c  ^  K*^{w  H-c  )dc  =  A  (A  -l)  J  c  {w-\-c)dc  . 

0  0 

Noting  (for  even  m  ) 

1  »  11 

J  K  {s  )J  K  {w)dw  =  ~J  K  {s)dsj  w"*  K  {w)dw  =  —  -  , 

-1  -1  2  2 

and  integrating  by  parts,  we  see  that 
2 

Jc  *  )dc  =  0,  0,  12,  360)U2  for  A  =0,  2,  4,  6,  respectively.  (10,15) 

0 

The  analysis  proceeds  exactly  as  in  Section  10.1  with  0+  replacing  7+.  Let  ^  ^(^ij  )•  From 
(10.1),  it  follows  that 
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E<l>ij  =  if")  -  h^/i^R  (/"')  +  0  (A^)  ,  (10.16) 

from  which  (3.18)  follows  directly.  From  (10.4)  and  (10.16),  it  follows  that 

Var  0,-y  =  hR  MR  (/  )  +  O  (A  «)  .  (10.17) 

Corresponding  to  (10.5)  we  have 


E  </>,;■  =h^ff  (X) 


6 

2/0+[/  {^)+jh h  2/''(x  )+-^h^c  V  "  (x  )+..] dc 


12 


dx 


oo 

=  Ai'>//"(x)2/  (x)ix  +  o(A‘°). 

-OO 


Therefore  Cov  (0,-y  ,<^,t )  =  O  (A  ^°).  Following  (10.8), 


(10.18) 


y<^r  )]  =  WhR  MR  {f)+0{n^h^)+0  (n^h  , 

i<j  ^ 

which,  together  with  (3.17),  proves  equation  (3.19). 

The  asymptotic  normality  follows  from  Theorem  3.1  of  Hall  (1984)  for  AN  of  degenerate 
f/ -statistics.  Let  fi(t  )=EIC  —JC)/h  where  A=cn”^^®.  Then  A^(C|*y)  may  be  decomposed 
into 

+  fM(t)lK"(-^)  +  K"(-L^)]dt  -f^tfdt  . 

The  mean  comes  from  the  last  two  integrals  and  is  easily  checked  to  be  h^R  (/")  +  o  (A®).  The 
variance  comes  from  the  first  integral,  which  we  denote  by  II„  (a:,-  ,Xj- ).  Now  it  may  be  verified 
that  the  random  variable  B  [i/„  {X ^Y)  \  X]=0,  so  that  is  b,  degenerate  Martingale.  Using  the 
notation  of  Hall’s  (1984)  equation  (2.1),  calculations  similar  to  the  above  give 
=  h^R  {<I>)R  (/  ),  =  h^R  {<I>^)R  (/  ),  and  EG^  =  0  (A^);  therefore,  the  conditions  for 

Hall’s  Theorem  3.1  hold  and  BCV (A  )  is  AN . 


10.4*  Proof  of  Corollaries  3.2  and  3.3  in  Section  3*2  and  3.3.1 


From  (2.3)  and  Theorem  3.2  we  have  that 
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plim  [BCV{ch')/MISE{ch 

n  -+00 

plim  [AMISE  [ch  *  )/MISE  (ch 

n  —►00 

AMISE  (ch  ' )/ AMISE  (h 
so  that  MISE  (ch  )  >  MISE  (h  for  c  ^1  and  large  n  . 
one.  Then  Prob  {BCV(hBcv)  <  BCV(h’)}  —  1  as  n 


*)]=! 

*)  = 

^  5c 


(10.19) 


Suppose  /h*  does  not  converge  to 


—►00,  which  contradicts  the  consistency 


results  in  (10.19). 


The  proof  of  Corollary  3.3  was  first  given  by  Hall  (1983). 


10.5.  Proof  of  Lemma  4.1  in  Section  4.1 

As  before,  define  p^{c  )  from  (4.2)  when  0<  c  <2.  Then  it  may  be  shown  that 
2 

Jc*p+(c  )dc  =  0,  -15/^2^,  -105/i2A*4  for  ^  =0,2, 4, 6,  respectively, 

0  ^ 

Let  =  p(c,y ).  Omitting  details, 

Epi^  =hR(f)-  ^pih^R  (/")  +  ^p^p,h^R  (/»')  +  0  (A^) 
from  which  the  expectation  in  (4.4)  may  be  computed.  Then 

Var  p,.y  =  hR  (p)R  (f  )-h^R(f  f+  O  (A®) 

Cov  (p,-y  ,p,* )  =  A^/g  +  ^pih<>[I^-h]  +  0  (A «) 

Cov  (7.-y  ,Pik )  =  -h  «[/i-/2]  +  0  (A ») 

Cov  h.-y  ,p.-y  )  =  hR  (s/Tp)R  (f)  +  h^R(!f+0  (h% 

Now  the  variance  of  the  left  hand  side  of  (4.4)  may  be  expressed  as 

y  n  (n  -1)[  Vor  Tf.-y  +  Var  p,-y  ]  +  n  (n  -l)(n  -2)[Cov  (^.-y  ,7,*  )  +  Cov  (p.-y  ,p,*  )]  + 

n  (n  -l)C'ov  (7,-y  ,p,-y )  +  2n  (n  -l)(n  -2)0707;  (7,-y  ,p,*  )  .  (10.20) 

Evaluating  (10.20),  we  find 


Var  EEW<=.7  )+p( <=.;)]  =  [R  {l)+R  (p)+2R  (^p)]R  (f  ) 

i<j  ^ 


where  the  bracketed  term  may  be  written  as  R  (^H-p).  But  )  =  p^.(c  )/c  .  Hence 

dc 
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/7+(c  )p+(c  )rfc  =  Jc7+(c)7+'  (c)rfc  =-— J7_^(c  )2rfc 
0  0  ^0 

since  7+(2)=0.  Since  7  is  symmetric,  it  follows  that  R  (7+p)  =  R  (p),  which  completes  the  argu¬ 
ment. 

10*6.  Proof  of  Lemma  4.2  in  Section  4.2 

Briefly, 

2 

Jc  *  )dc  =  0,  0,  -60,  -2520^2  for  k  —0,2, 4, 6,  respectively, 

0 

£;^(c,y)  =  -5A^i?(n+0(A^) 

Vari;{ci,-)  =  hR{i^)R{f  )-^  0{h^) 

Cov  )  =  O  (A  ;  Cot;  .V'a  )  =  O  (A 

Cov  (0,y  ,V-.-y )  -  Ai?  {s/^)R  (/  )  +  O  (A®) 
and  i?  (^+^)  =  i?  (^)  as  above.  The  lemma  follows  directly. 
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Legends  for  Tables  and  Figures 

For  several  kernels,  asymptotic  ratios  of  “vertical”  and  “horizontal”  standard  devia¬ 
tion  of  unbiased  and  biased  cross-validation  estimates  and  smoothing  parameters  as 
given  in  expressions  (3.25)  and  (4.16). 

Summary  of  a  Monte  Carlo  experiment  using  a  triweight  kernel  averaged  shifted  histo¬ 
gram  estimator  with  standard  Gaussian  data.  The  sample  means  and  variances  of  the 
cross-validation  smoothing  parameters  are  given,  together  with  the  theoretical  stan¬ 
dard  deviations  given  in  Theorems  4.1  and  4.2.  The  theoretical  predictions  of  the 
standard  deviations  of  h^Qy  and  ki/Qy  are  denoted  by  tr^cy  and  <7i;cv ;  while  sample 
versions  are  indicated  by  a  circumflex. 

Summary  of  partial  results  of  a  Monte  Carlo  experiment  for  three  sampling  densities. 
Other  details  are  the  same  as  in  Table  II. 

Biased  (a)  and  unbiased  (b)  cross-validation  curves  for  a  histogram  estimator  of  10,000 
iV(5,l)  points.  The  vertical  lines  in  the  bottom  right  hand  corner  of  the  figures  indi¬ 
cate  theoretical  standard  deviations  computed  from  Theorems  3.1,  3.2,  and  3.3  as  dis¬ 
cussed  in  the  text.  The  predicted  optimal  MISE  smoothing  parameter  is  indicated  by 
a  star. 

Examples  of  biased  and  unbiased  cross-validation  curves  (on  a  logio  scale)  for  a  Gaus¬ 
sian  kernel  estimator  of  25  N{0,1)  points  in  (a)  and  400  points  in  (6).  The  exact 
mean  integrated  squared  error  is  shown  by  the  dotted  line.  The  corresponding  cross- 
validation  density  estimates  are  shown  in  (c  ),  along  with  the  true  density  (dotted 
line). 

Histograms  of  biased  and  unbiased  cross-validation  smoothing  parameters  for  AT  (0,1) 
samples  of  several  sizes  using  an  ASH  triweight  kernel  estimator.  The  BCV  histo¬ 
gram  is  in  the  positive  direction  while  the  UGV  histogram  is  in  the  negative  direction. 
The  location  of  h^js£  is  indicated  by  a  star  on  the  horizontal  axis.  These  figures  are 
discussed  more  fully  in  Section  6.1. 

Scatter  plots  of  the  various  smoothing  parameters  are  shown  for  the  same  Monte  Carlo 
data  as  in  Figure  3  with  sample  sizes  n  =400  and  n  =25600. 

Using  the  same  samples  as  in  Figure  4,  scatter  plots  of  the  \o^iq{ISE  )  corresponding  to 
the  various  smoothing  parameters  are  shown.  The  first  row  is  for  the  n  =400  samples 
and  the  second  row  is  for  the  n  =25600  samples.  The  diagonal  line  indicates  y  =x  . 

Similar  to  Figure  3,  except  with  n=1600  from  Cauchy,  Lognormal,  and  Mixture  Den¬ 
sities. 

Biased  and  unbiased  cross-validation  density  estimates  of  1600  points  from  a  Lognor¬ 
mal  distribution. 

Biased  and  unbiased  cross-validation  density  estimates  of  400  points  from  a  Mixture 
distribution. 

For  the  samples  in  Figure  4,  scatter  plots  of  hjsE  and  the  sample  standard  deviation 
for  each  sample. 

On  a  square  root  scale,  triweight  ASH  estimates  of  the  LRL  data,  with  m  =  1,  2,  and 
4.  The  bumps  found  by  Good  and  Gaskins  with  a  penalized-likelihood  density  estima¬ 
tor  are  indicated  by  horizontal  lines  above  the  bump. 
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Table  I.  Asymptotic  Ratio  of  “Vertical”  Standard  Deviations  of  UCV  and  BCV  Esti¬ 
mators 


niR  (fl^)V2/4 

ratio 

Eqnf3.25) 

m  =2  (biweight) 

1.0033 

.0827 

12.13 

1.2352 

_ 

_ 

m  =3  (triweight) 

.0921 

11.65 

1.2047 

.2420 

4.98 

m  =4 

.1013 

11.20 

1.2195 

.2550 

4.78 

m  =5 

1.1859 

.1092 

10.86 

1.2458 

.2685 

4.64 

N(0,1)  (“m  =oo”) 

.0715 

8.92 

.6178 

.1558 

3.96 

Table  II.  Monte  Carlo  Results  or  Triweight  Kernel  ASH  Estimates  of  iV(0,l)  Data 


n 

m 

Bl 

^9 

^BCV 

^ucv 

ratio 

^BCV 

^ucv 

25 

1.775 

- 

1.907 

.4732 

100 

1.309 

1.499 

1.262 

WBimm 

.3122 

400 

.976 

1.041 

.935 

.0414 

.2060 

1600 

.732 

.753 

.683 

WmSm 

.1359 

6400 

.552 

.561 

.535 

.0246 

.1054 

4.27 

K  jSm 

.0896 

25600 

.416 

.419 

.416  ' 

.0128 

.0549 

4.28 

RWrI 

.0591 

Table  HI.  Partial  Monte  Carlo  Results  for  Other  Densities 


density 

n 

m 

^BCV 

hucv 

^BCV 

^ucv 

ratio 

^BCV 

^ucv 

Cauchy 

400 

1.012 

1.056 

.1292 

.2538 

1.96 

.1492 

jin 

.740 

.815 

.751 

mm 

.1448 

2.65 

.0198 

.549 

.551 

mBi 

3.28 

.0130 

,0649 

.411 

.418 

.415 

isii 

.0371 

2.57 

.0086 

.0428 

Lognormal 

400 

.324 

.326 

.1052 

.0776 

0.74 

.0050 

.0248 

.218 

.212 

.0331 

1.21 

.0033 

,0163 

.151 

.184 

,150 

.0137 

1.52 

.0022 

.0108 

.107 

.121 

.107 

.0048 

.0127 

2.63 

.0014 

.0071 

Mixture 

400 

.612 

- 

.618 

- 

.1512 

- 

.0167 

.0830 

.443 

.504 

.434 

.0749 

.0548 

.327 

.345 

.320 

.0425 

2.75 

.0073 

.0361 

.245 

.252 

.242 

.0294 

4.34 

.0048 

.0238 

w 
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